PerfCount
=========

PerfCount is a module to provide easy acces to the performance counters that are present on most/all ARMv5+ CPUs.

Basic usage:

1. SWI PerfCount_DescribeEvents to check which events are available on the host system
2. SWI PerfCount_CreateContext to create a context to track the indicated events
3. SWI PerfCount_Start to start measuring
4. SWI PerfCount_Stop to stop measuring
5. SWI PerfCount_Get to read the counter values
6. SWI PerfCount_DestroyContext to destroy the context

Features:

* Reporting of which events are available, using a common ID scheme to provide some consistency between different CPUs
* Support for multiple contexts, each measuring different events. Note: Currently, only one context can be active at a time.
* Automatic & manual multipass/multiplexing system to allow you to gather data for as many events as you want, bypassing the limits on how many events the CPU can track at once
* 64bit counter width to avoid overflow
* Optional per-context Wimp task association to pause/resume monitoring depending on whether a specific task is active (e.g. for system-wide profiling)
* Filtering by privileged & unprivileged CPU mode (if supported by the CPU)


SWIs
----

PerfCount_DescribeEvents          &59F80

  Out: R0 = number of events available
       R1 = number of events which can be active at once
       R2 -> list of event IDs (one word per ID)
       R3 -> list of event names (pointers to null-terminated strings)
       R4 = CPU ID (CP15 main ID register)
       R5 -> CPU name string
       R6 -> list of internal event IDs (one word per ID)

Returns information on the available events, and the system as a whole. Entries in R2, R3 and R6 all correspond to each other, i.e. element N of R2 corresponds to element N of R3 & R6.

R2 contains the event IDs supported by the CPU, using PerfCount's ID scheme. See the "Event IDs" section for more info.

R3 contains the event names/mnemonics. Where possible, these match the mnemonics used in ARM's documentation (mainly the PME section of the ARMv7/v8 Architecture Reference Manuals)

R6 contains the "internal event IDs" that correspond to the PerfCount event IDs. These are the actual event IDs used by the CPU; you can cross-reference them against the relevant CPU manual to find more information about the events. The only exception to this rule is the XScale CPU cycles event, which PerfCount assigns a fake internal ID of -1.

R1 indicates the maximum number of events that can be active at once. However note that in some cases it may be impossible to have this many events active at once. E.g. privileged-only and unprivileged-only events require PerfCount to use an extra performance counter to track the number of CPU cycles that the "in privileged mode" or "in unprivileged mode" condition holds for, in order to provide an accurate value for return in R2+R3 from PerfCount_Get.


PerfCount_CreateContext           &59F81

  In: R0 = Number of events to monitor
      R1 -> event ID list (1 word each)
      R2 = Flags
           bit 0: 0 = must be single-shot mode
                  1 = multi-pass mode acceptable
           bit 1: 0 = manual multi-pass
                  1 = automatic multi-pass
           bits 2+: reserved (zero)
      R3 = Task handle to bind to, or 0 for none

  Out: R0 = context handle

Creates a profiling context.

The ID list pointed to by R1 can be a temporary list; PerfCount will make a copy of it.

The flags in the bottom two bits of R2 control how PerfCount behaves when there are too many events for the CPU to track them all at once.

* If bit 0 is clear, context creation will fail with an error
* If bit 0 is set and bit 1 is clear, manual multi-pass mode is used: You must call PerfCount_Multipass to cycle to the next set of counters
* If bit 0 is set and bit 1 is set, automatic multi-pass mode is used. PerfCount will automatically cycle through the different events by using the 100Hz TickerV interrupt.

The created context will be unpaused, but inactive; PerfCount_Start must be called to activate it.


PerfCount_DestroyContext          &59F82

  In: R0 = context handle

Destroys the given profiling context.


PerfCount_CurrentContext          &59F83

  Out: R0 = context handle

Returns the ID of the active context.


PerfCount_Start                   &59F84

  In: R0 = context handle

Starts profiling. Returns an error if a conflicting context is already active. If the context is paused, this is effectively a no-op.


PerfCount_Stop                    &59F85

  In: R0 = context handle

Stops profiling. Also synchronises counters, like PerfCount_Sync.


PerfCount_Get                     &59F86

  In: R0 = context handle
      R1 = event ID

  Out: R0,R1 = 64bit event count (R0 = low bits, R1 = high bits)
       R2,R3 = 64bit number of CPU cycles the event has been monitored for

Returns buffered counter values. Buffered values are used so that you can safely read the values for an active context without having to worry about the values changing while you're reading them.

For each event, PerfCount also keeps track of how many CPU cycles that event was being monitored for. This is necessary to allow you to correctly compare values for different events when multi-pass mode is in use.


PerfCount_Sync                    &59F87

  In: R0 = context handle

Synchronises counters, to allow PerfCount_Get to return up-to-date values for a running context.


PerfCount_Pause                   &59F88

  In: R0 = context handle

Pauses the given context. An internal count is maintaned of how many times the context has been paused, to allow for nested calls. Unlike PerfCount_Stop, this will not syncronise the buffered counts available via PerfCount_Get.


PerfCount_Unpause                 &59F89

  In: R0 = context handle

Decrements the pause count by one, potentially resuming the context.


PerfCount_Reset                   &59F8A

  In: R0 = context handle

Resets all counters (buffered & actual) to zero.


PerfCount_Status                  &59F8B

  In: R0 = context handle

 Out: R0,R1 = Number of cycles the context has been active for
              (minimum of all cycle counts, for multi-pass contexts)
              Based on buffered values.
      R2 = bit 0: 1 = context is stopped at user request (PerfCount_Stop)
           bit 1: 1 = context is stopped because task is inactive
           bits 2-15: Zero (reserved for future stop reasons)
           bits 16+: pause count

Returns information about a context.


PerfCount_Multipass               &59F8C

  In: R0 = context handle

 Out: R0 = number of complete passes completed

Used to cycle to next set of events, in manual multi-pass mode. In automatic mode it will only return the number of passes.


Event IDs
---------

Event IDs have the following format:

  Bits 0-15   Event number
  Bits 16-27  Reserved, 0
  Bit 28      1 if count is approximate, 0 if accurate
  Bit 29      1 if count is level-based, 0 if edge-based
  Bit 30      1 if event does not count in unprivileged CPU modes
  Bit 31      1 if event does not count in privileged CPU modes

Bit 29 will be set if the counter increments for every cycle in which the event condition holds, e.g. "number of cycles CPU is stalled" (bit 29 set), versus "number of times the CPU stalls" (bit 29 clear).

Bits 0-15 are the event number:

No. Name                     Platforms
  0 L1I_CACHE_REFILL         PME (&01)
  1 L1I_TLB_REFILL           PME (&02)
  2 L1D_CACHE_REFILL         PME (&03)
  3 L1D_CACHE                PME (&04), XScale (&0A)
  4 L1D_TLB_REFILL           PME (&05)
  5 LD_RETIRED               PME (&06)
  6 ST_RETIRED               PME (&07)
  7 INST_RETIRED             PME (&08), XScale (&07), ARM11 (&07)
  8 EXC_TAKEN                PME (&09)
  9 EXC_RETURN               PME (&0A)
 10 CID_WRITE_RETIRED        PME (&0B)
 11 PC_WRITE_RETIRED         PME (&0C)
 12 BR_IMMED_RETIRED         PME (&0D), ARM11 (&05)
 13 BR_RETURN_RETIRED        PME (&0E), ARM11 (&24), Cortex-A9 (&6E)
 14 UNALIGNED_LDST_RETIRED   PME (&0F)
 15 BR_MIS_PRED              PME (&10), XScale (&06) ,ARM11 (&06)
 16 CPU_CYCLES               PME (&11), XScale (-1), ARM11 (&FF)
 17 BR_PRED                  PME (&12)
 18 MEM_ACCESS               PME (&13)
 19 L1I_CACHE                PME (&14), Cortex-A8 (&50)
 20 L1D_CACHE_WB             PME (&15), XScale (&0C), ARM11 (&0C)
 21 L2D_CACHE                PME (&16), Cortex-A8 (&43)
 22 L2D_CACHE_REFILL         PME (&17)
 23 L2D_CACHE_WB             PME (&18)
 24 BUS_ACCESS               PME (&19)
 25 MEMORY_ERROR             PME (&1A)
 26 INST_SPEC                PME (&1B), Cortex-A9 (&68)
 27 TTBR_WRITE_RETIRED       PME (&1C)
 28 BUS_CYCLES               PME (&1D)
 29 L1D_CACHE_ALLOCATE       PME (&1F)
 30 L2D_CACHE_ALLOCATE       PME (&20)
 31 BR_RETIRED               PME (&21), XScale (&05)
 32 BR_MIS_PRED_RETIRED      PME (&22)
 33 STALL_FRONTEND           PME (&23), ARM11 (&01)
 34 STALL_BACKEND            PME (&24), Cortex-A9 (&66)
 35 L1D_TLB                  PME (&25)
 36 L1I_TLB                  PME (&26)
 37 L2I_CACHE                PME (&27)
 38 L2I_CACHE_REFILL         PME (&28)
 39 L3D_CACHE_ALLOCATE       PME (&29)
 40 L3D_CACHE_REFILL         PME (&2A)
 41 L3D_CACHE                PME (&2B)
 42 L3D_CACHE_WB             PME (&2C)
 43 L2D_TLB_REFILL           PME (&2D)
 44 L2I_TLB_REFILL           PME (&2E)
 45 L2D_TLB                  PME (&2F)
 46 L2I_TLB                  PME (&30)
 47 REMOTE_ACCESS            PME (&31)
 48 LL_CACHE                 PME (&32)
 49 LL_CACHE_MISS            PME (&33), Cortex-A8 (&44)
 50 DTLB_WALK                PME (&34)
 51 ITLB_WALK                PME (&35)
 52 LL_CACHE_RD              PME (&36)
 53 LL_CACHE_MISS_RD         PME (&37)
 54 REMOTE_ACCESS_RD         PME (&38)
 55 L1D_CACHE_LMISS_RD       PME (&39)
 56 OP_RETIRED               PME (&3A)
 57 OP_SPEC                  PME (&3B)
 58 STALL                    PME (&3C)
 59 STALL_SLOT_BACKEND       PME (&3D)
 60 STALL_SLOT_FRONTEND      PME (&3E)
 61 STALL_SLOT               PME (&3F)
 62 ICacheMiss               XScale (&00), ARM11 (&00)
 63 ICacheEmpty              XScale (&01)
 64 Stall_DataDependency     XScale (&02), ARM11 (&02)
 65 ITLBMiss                 XScale (&03)
 66 DTLBMiss                 XScale (&04)
 67 Stall_DCacheFull         XScale (&08, &09)
 68 DCacheMiss               XScale (&0B), ARM11 (&0B)
 69 PCWrite                  XScale (&0D), ARM11 (&0D)
 70 MispredictedReturn       ARM11 (&26), Cortex-A8 (&51)
 71 PredictedReturn          ARM11 (&25)
 72 ProcedureCall            ARM11 (&23)
 73 WBDrain                  ARM11 (&12)
 74 Stall_LSU                ARM11 (&11)
 75 ExternalData             ARM11 (&10)
 76 MainTLBMiss              ARM11 (&0F)
 77 NonSeqData               ARM11 (&0A)
 78 NonSeqData_Cacheable     ARM11 (&09)
 79 MicroDTLBMiss            ARM11 (&04)
 80 MicroITLBMiss            ARM11 (&03)
 81 WBFull                   Cortex-A8 (&40)
 82 L2Merged                 Cortex-A8 (&41)
 83 L2Store                  Cortex-A8 (&42)
 84 AXIRead                  Cortex-A8 (&45)
 85 AXIWrite                 Cortex-A8 (&46)
 86 MemReplay                Cortex-A8 (&47)
 87 UnalignedMemReplay       Cortex-A8 (&48)
 88 L1DHashMiss              Cortex-A8 (&49)
 89 L1IHashMiss              Cortex-A8 (&4A)
 90 L1DPageColouring         Cortex-A8 (&4B)
 91 NEON_L1Hit               Cortex-A8 (&4C)
 92 NEON_L1Access            Cortex-A8 (&4D)
 93 NEON_L2Access            Cortex-A8 (&4E)
 94 NEON_L2Hit               Cortex-A8 (&4F)
 95 BranchMispredicted       Cortex-A8 (&52)
 96 BranchPredictedAsTaken   Cortex-A8 (&53)
 97 PredictableBranchTaken   Cortex-A8 (&54)
 98 MicroOpsIssued           Cortex-A8 (&55)
 99 NoInstruction            Cortex-A8 (&56)
100 InstructionIssued        Cortex-A8 (&57)
101 Stall_NEON_MRC           Cortex-A8 (&58)
102 Stall_NEON_Full          Cortex-A8 (&59)
103 CPU_NEON_Working         Cortex-A8 (&5A)
104 JavaExecuted             Cortex-A9 (&40)
105 SWJavaExecuted           Cortex-A9 (&41)
106 JavaBackwardsBranch      Cortex-A9 (&42)
107 CoherentLinefillMiss     Cortex-A9 (&50)
108 CoherentLinefillHit      Cortex-A9 (&51)
109 Stall_ICacheLineFill     Cortex-A9 (&60)
110 Stall_DCacheLineFill     Cortex-A9 (&61)
111 Stall_MainTLBMiss        Cortex-A9 (&62)
112 STREX_Pass               Cortex-A9 (&63)
113 STREX_Fail               Cortex-A9 (&64)
114 DCache_Eviction          Cortex-A9 (&65)
115 Issue_Empty              Cortex-A9 (&67)
116 DataLinefill             Cortex-A9 (&69)
117 PrefetchLinefill         Cortex-A9 (&6A)
118 PrefetchHit              Cortex-A9 (&6B)
119 MainUnitInstrs           Cortex-A9 (&70)
120 SecUnitInstrs            Cortex-A9 (&71)
121 LSInstrs                 Cortex-A9 (&72)
122 FPInstrs                 Cortex-A9 (&73)
123 NEONInstrs               Cortex-A9 (&74)
124 Stall_PLD                Cortex-A9 (&80)
125 Stall_Write              Cortex-A9 (&81)
126 Stall_MainTLBMiss_Instr  Cortex-A9 (&82)
127 Stall_MainTLBMiss_Data   Cortex-A9 (&83)
128 Stall_MicroTLBMiss_Instr Cortex-A9 (&84)
129 Stall_MicroTLBMiss_Data  Cortex-A9 (&85)
130 Stall_DMB                Cortex-A9 (&86)
131 Clock_Integer            Cortex-A9 (&8A)
132 Clock_DataEngine         Cortex-A9 (&8B)
133 Clock_NEON               Cortex-A9 (&8C)
134 TLBAlloc_Instr           Cortex-A9 (&8D)
135 TLBAlloc_Data            Cortex-A9 (&8E)
136 ISB                      Cortex-A9 (&90)
137 DSB                      Cortex-A9 (&91)
138 DMB                      Cortex-A9 (&92)
139 ExtInterrupt             Cortex-A9 (&93)
140 PLE_Line_Completed       Cortex-A9 (&A0)
141 PLE_Line_Skipped         Cortex-A9 (&A1)
142 PLE_FIFO_Flush           Cortex-A9 (&A2)
143 PLE_Req_Completed        Cortex-A9 (&A3)
144 PLE_FIFO_Overflow        Cortex-A9 (&A4)
145 PLE_Programmed           Cortex-A9 (&A5)

Key:

* PME - Event corresponds to one of the standard events in the Performance Monitor Extension specification. CPUs which implement the PME will support a subset of these events.
* XScale - Event is XScale-specific/supported by XScale.
ARM11 - Event is ARM11-specific/supported by ARM11.
* Cortex-A8 - Event is Cortex-A8 specific/supported by Cortex-A8. Note that several of the standard PME events will also be supported, but to save space I haven't listed them here.
* Cortex-A9 - Event is Cortex-A9 specific/supported by Cortex-A9. Note that several of the standard PME events will also be supported, but to save space I haven't listed them here.

The numbers in brackets indicate the internal event ID(s). E.g. two numbers are listed for Stall_DCacheFull because the event is available in both edge- and level- based form.


Known issues / future plans
---------------------------

* Allow multiple contexts to be active at once if the combined set of events they want to monitor can be tracked concurrently by the hardware (or if they'll allow automatic multiplexing to be used to cycle through the events)
* Add support for the Cortex-A7, A15, A53 & A72 -specific events
* Wimp task association and automatic multiplexing doesn't work very well for monitoring tasks which are only active for short periods of time (very low chance of the TickerV interrupt occuring while the task is active); triggering a multipass cycle each time the task is paged out might solve that
* Better privileged/unprivileged filtering (currently this is handled by using one of the event registers to track the priv/unpriv cycle count, but this can also be done by reconfiguring the main cycle count register, which would allow more events to be tracked at once)


Reference material
------------------

ARMv7-AR Architecture Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406c/index.html

ARMv8-A Architecture Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.a/index.html

Intel XScale Core Developer's Manual
http://download.intel.com/design/intelxscale/27347302.pdf

ARM1176JZF-S Technical Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301h/index.html

Cortex-A8 Technical Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/index.html

Cortex-A9 Technical Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.100511_0401_10_en/index.html


History
-------

0.01 - 20/6/2020
- First release


Legal
-----

Copyright (c) 2020, Jeffrey Lee
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.


Contact info
------------

Jeffrey Lee
me@phlamethrower.co.uk
http://www.phlamethrower.co.uk/

